Monitoring, Observability & Infrastructure Interview Guide

Prometheus (25 Questions)

Core Concepts

What is Prometheus? Key features
What is the Prometheus architecture? Components (Prometheus Server, Pushgateway, Alertmanager, Exporters)
What is a time-series database? How does Prometheus store data?
What is the difference between monitoring and observability?
What are the four golden signals of monitoring? (Latency, Traffic, Errors, Saturation)

Metrics & Data Model

What are metrics in Prometheus? Types of metrics:
- Counter
- Gauge
- Histogram
- Summary
What is a metric label? How to use labels effectively?
What is cardinality? Why is high cardinality problematic?
What is the difference between histogram and summary?
When to use Counter vs Gauge?
What is metric naming convention in Prometheus?
What is the data retention period in Prometheus?

PromQL (Prometheus Query Language)

What is PromQL? Basic query syntax
What is an instant vector vs range vector?
Common PromQL functions:
- rate() - Calculate per-second rate
- irate() - Instant rate
- increase() - Total increase
- sum(), avg(), max(), min()
- by and without aggregations
What is the difference between rate() and irate()?
How to calculate percentiles in PromQL? (histogram_quantile)
How to filter metrics by labels in PromQL?
What is up metric? How to use it for health checks?

Integration & Exporters

What is an exporter in Prometheus?
Common exporters:
- Node Exporter (system metrics)
- JMX Exporter (Java applications)
- Blackbox Exporter (endpoint monitoring)
- Custom exporters
How to integrate Prometheus with Spring Boot? (Micrometer, Actuator)
What is Pushgateway? When to use it?
What is service discovery in Prometheus? (static config, DNS, Kubernetes, Consul)

Alerting

What is Alertmanager? How does it work?
How to configure alerts in Prometheus?
What is alert routing and grouping?
What is silencing in Alertmanager?
What are best practices for alert thresholds?

Grafana (20 Questions)

Core Concepts

What is Grafana? Key features
What is the difference between Prometheus and Grafana?
What are Grafana data sources? (Prometheus, InfluxDB, Elasticsearch, CloudWatch, MySQL)
What is a dashboard in Grafana?
What is a panel? Types of panels (Graph, Gauge, Table, Heatmap, Stat)

Dashboards & Visualization

How to create a dashboard in Grafana?
What are dashboard variables? Use cases
What is templating in Grafana?
How to create dynamic dashboards?
What are dashboard annotations?
What is the difference between absolute time and relative time ranges?
How to share dashboards? (export JSON, snapshots, public dashboards)

Alerting

How to configure alerts in Grafana?
What are notification channels? (Email, Slack, PagerDuty, Webhook)
What is alert state? (Pending, Alerting, OK, No Data)
What is the difference between Grafana alerts and Prometheus alerts?

Advanced Features

What is Grafana Loki? How is it different from Elasticsearch?
What is Grafana Tempo? (distributed tracing)
What are Grafana plugins?
How to use Grafana with Kubernetes?
What are best practices for dashboard design?

ELK Stack (Elasticsearch, Logstash, Kibana) (30 Questions)

Elasticsearch

What is Elasticsearch? Core concepts
What is an index in Elasticsearch?
What is a document and document ID?
What is a shard? Primary shard vs replica shard
Why is sharding important in Elasticsearch?
What is an inverted index?
What is a mapping in Elasticsearch?
What are analyzers? Common analyzers (Standard, Whitespace, Keyword, Pattern)
What is the difference between text and keyword data types?
How does Elasticsearch achieve near real-time search?
What is a cluster, node, and index in Elasticsearch?
What is the difference between GET and SEARCH API?
What are query types in Elasticsearch?
- Match query
- Term query
- Range query
- Bool query
- Wildcard query
What is aggregation in Elasticsearch? (Bucket, Metric, Pipeline)
What is the difference between term query and match query?
How to perform full-text search in Elasticsearch?
What is scoring and relevance in Elasticsearch?
How to optimize Elasticsearch performance?
What is circuit breaker in Elasticsearch?
What is index lifecycle management (ILM)?

Logstash

What is Logstash? Architecture
What are the three stages of Logstash pipeline? (Input, Filter, Output)
Common Logstash input plugins (file, beats, kafka, jdbc)
Common Logstash filter plugins (grok, mutate, date, json, geoip)
Common Logstash output plugins (elasticsearch, file, kafka, stdout)
What is Grok pattern? How to use it?
What is the difference between Logstash and Filebeat?
How to handle log parsing errors in Logstash?

Kibana

What is Kibana? Key features
What is Discover in Kibana?
How to create visualizations in Kibana? (Bar, Line, Pie, Heatmap, Data Table)
What is a Kibana dashboard?
What are index patterns in Kibana?
What is Kibana Query Language (KQL)?
What is Kibana Lens?
How to create alerts in Kibana?
What is Canvas in Kibana?

ELK Stack Integration & Best Practices

What is the typical ELK stack workflow?
What are Beats? (Filebeat, Metricbeat, Packetbeat, Heartbeat, Auditbeat)
When to use Logstash vs Filebeat?
How to secure ELK stack? (authentication, encryption, role-based access)
What are best practices for log management?
How to handle large-scale log ingestion?
What is hot-warm-cold architecture in Elasticsearch?

Apache Kafka (35 Questions)

Core Concepts

What is Apache Kafka? Use cases
What is the Kafka architecture? Components:
- Broker
- Topic
- Partition
- Producer
- Consumer
- Zookeeper/KRaft
What is a topic in Kafka?
What is a partition? Why is partitioning important?
What is a broker in Kafka?
What is a Kafka cluster?
What is the role of Zookeeper in Kafka?
What is KRaft mode? Difference from Zookeeper
What is a message/record in Kafka? (Key, Value, Timestamp, Headers)

Producers

What is a Kafka producer?
How does a producer send messages to Kafka?
What is producer acknowledgment (acks)? (0, 1, all/-1)
What is idempotent producer?
What is the difference between sync and async send?
What is partitioner? How does producer choose partition? (key-based, round-robin, custom)
What is producer batching?
What are producer configuration parameters?

batch.size
linger.ms
compression.type
max.in.flight.requests.per.connection

Consumers

What is a Kafka consumer?
What is a consumer group?
How does Kafka achieve load balancing among consumers?
What is consumer offset?
What is offset commit? Auto-commit vs manual commit
What happens when a consumer fails? (rebalancing)
What is consumer lag? How to monitor it?
What is the difference between poll() and subscribe()?
What is enable.auto.commit?
What are consumer configuration parameters?

group.id
auto.offset.reset (earliest, latest, none)
max.poll.records
session.timeout.ms

Replication & Fault Tolerance

What is replication in Kafka?
What is replication factor?
What is leader and follower?
What is ISR (In-Sync Replica)?
How does Kafka ensure message durability?
What is min.insync.replicas?
What happens when a broker fails?
What is unclean leader election?

Performance & Scalability

How does Kafka achieve high throughput?
What is log compaction?
What is retention policy in Kafka? (time-based, size-based)
How to scale Kafka? (add brokers, increase partitions)
What is the relationship between partitions and parallelism?
What are Kafka performance tuning tips?
What is the difference between Kafka and traditional message queues? (RabbitMQ, ActiveMQ)

Kafka Streams & Connect

What is Kafka Streams?
What is Kafka Connect?
What are source and sink connectors?
When to use Kafka Streams vs Kafka Connect?

Monitoring & Operations

How to monitor Kafka? (JMX metrics, Kafka Manager, Burrow)
Important Kafka metrics to monitor:

UnderReplicatedPartitions
OfflinePartitionsCount
ActiveControllerCount
RequestHandlerAvgIdlePercent

What is Kafka MirrorMaker?

Redis (30 Questions)

Core Concepts

What is Redis? Key features
What makes Redis fast? (in-memory, single-threaded, efficient data structures)
What are Redis data types?

String
List
Set
Sorted Set
Hash
Bitmap
HyperLogLog
Stream

What is Redis use case? (caching, session storage, rate limiting, real-time analytics)
What is the difference between Redis and Memcached?
Is Redis single-threaded? How does it handle concurrent requests?

Data Structures & Commands

Important String commands (SET, GET, INCR, DECR, MSET, MGET)
Important List commands (LPUSH, RPUSH, LPOP, RPOP, LRANGE)
Important Set commands (SADD, SMEMBERS, SINTER, SUNION, SDIFF)
Important Sorted Set commands (ZADD, ZRANGE, ZRANK, ZINCRBY)
Important Hash commands (HSET, HGET, HGETALL, HINCRBY)
What is the time complexity of common Redis operations?
What is SCAN command? Difference from KEYS
What are Redis transactions? (MULTI, EXEC, DISCARD, WATCH)

Persistence

What are Redis persistence mechanisms?

RDB (Redis Database Backup)
AOF (Append-Only File)

What is the difference between RDB and AOF?
When to use RDB vs AOF?
What is hybrid persistence (RDB+AOF)?
What is snapshotting in Redis?

Caching Strategies

What are caching strategies?

Cache-Aside (Lazy Loading)
Write-Through
Write-Behind (Write-Back)
Read-Through

What is cache eviction policy? (LRU, LFU, FIFO, Random, TTL)
What is TTL (Time To Live)?
How to handle cache stampede?
What is cache penetration, cache breakdown, and cache avalanche?
How to implement distributed locking in Redis? (SETNX, RedLock algorithm)

High Availability & Scalability

What is Redis Sentinel? How does it work?
What is Redis Cluster? How does it achieve scalability?
What is the difference between Redis Sentinel and Redis Cluster?
How does Redis Cluster handle data sharding?
What is hash slot in Redis Cluster?
What is split-brain problem in Redis?
What is Redis replication? Master-slave architecture
How to handle failover in Redis?

Performance & Monitoring

How to monitor Redis? (INFO command, redis-cli, monitoring tools)
Important Redis metrics:

Memory usage
Hit rate
Connected clients
Commands processed per second
Evicted keys

How to optimize Redis performance?
What is pipelining in Redis?
What is Redis pub/sub?
What are Redis Streams? Use cases
What is the maximum size of a Redis key/value?

CDN (Content Delivery Network) (20 Questions)

Core Concepts

What is CDN? How does it work?
What are the benefits of using CDN? (reduced latency, improved performance, DDoS protection, reduced bandwidth cost)
What is edge server/edge location?
What is origin server?
What is Point of Presence (PoP)?
How does CDN routing work? (DNS-based routing, Anycast)

CDN Types & Architecture

What are types of CDN?

Push CDN
Pull CDN

What is the difference between push and pull CDN?
What is CDN caching? Cache hierarchy
What is cache hit ratio?
What is Time To Live (TTL) in CDN?
What is cache invalidation/purging?

CDN Features

What is edge computing?
What is CDN load balancing?
How does CDN handle dynamic content?
What is CDN SSL/TLS termination?
What is geo-blocking in CDN?
What are CDN security features? (DDoS protection, WAF, bot protection)
What is image optimization in CDN?
What is compression in CDN? (Gzip, Brotli)

Popular CDN Providers

Popular CDN providers (Cloudflare, Akamai, AWS CloudFront, Fastly, Azure CDN)
What is CloudFlare? Key features
What is AWS CloudFront?
How to integrate CDN with your application?

Spring Boot Actuator (25 Questions)

Core Concepts

What is Spring Boot Actuator?
How to enable Actuator in Spring Boot?
What are Actuator endpoints?
What is the difference between web endpoints and JMX endpoints?

Built-in Endpoints

Important Actuator endpoints:

/actuator/health - Application health
/actuator/info - Application information
/actuator/metrics - Application metrics
/actuator/env - Environment properties
/actuator/beans - Spring beans
/actuator/mappings - Request mappings
/actuator/loggers - Logger configuration
/actuator/threaddump - Thread dump
/actuator/heapdump - Heap dump
/actuator/prometheus - Prometheus metrics

What is health indicator? (disk space, database, Redis, custom)
How to create custom health indicators?
What is health status? (UP, DOWN, OUT_OF_SERVICE, UNKNOWN)
How to expose/hide specific endpoints?
What is the /info endpoint? How to add custom info?

Metrics

What metrics are available in Actuator?

JVM metrics (memory, threads, GC)
HTTP metrics (request count, response time)
Database metrics (connection pool)
Custom metrics

How to create custom metrics? (MeterRegistry, Counter, Gauge, Timer)
What is Micrometer?
How to integrate Actuator with Prometheus?
What is dimensional metrics?

Security & Configuration

How to secure Actuator endpoints?
What is the role of Spring Security with Actuator?
How to configure endpoint exposure? (management.endpoints.web.exposure.include)
What is base path for Actuator? (management.endpoints.web.base-path)
How to customize Actuator endpoints?
What is @Endpoint annotation?

Advanced Features

How to create custom Actuator endpoints?
What is /auditevents endpoint?
How to monitor application performance using Actuator?
How to integrate Actuator with external monitoring systems? (Grafana, Prometheus, ELK)

Application Performance Monitoring (APM) (20 Questions)

Core Concepts

What is APM? Why is it important?
What is distributed tracing?
What is a trace, span, and trace ID?
What is observability vs monitoring?
Three pillars of observability (Logs, Metrics, Traces)

APM Tools

Popular APM tools:

New Relic
Datadog
AppDynamics
Dynatrace
Elastic APM
Jaeger
Zipkin

What is Zipkin? How does it work?
What is Jaeger? Architecture
What is Spring Cloud Sleuth?
How to implement distributed tracing in Spring Boot? (Sleuth + Zipkin)

Metrics & Monitoring

What is Apdex score?
What is response time percentiles? (p50, p95, p99)
What is throughput and latency?
What is error rate?
How to monitor database query performance?
How to identify performance bottlenecks?
What is transaction tracing?
What is Real User Monitoring (RUM)?
What is Synthetic Monitoring?
What is the difference between RUM and Synthetic Monitoring?

API Monitoring & Management (20 Questions)

API Monitoring

What is API monitoring? Why is it important?
What metrics should be monitored for APIs?

Response time
Error rate
Request rate
Availability/Uptime
Latency

How to monitor API endpoints? (health checks, synthetic monitoring)
What is API uptime monitoring?
What are API monitoring tools? (Postman, Runscope, Pingdom, Uptime Robot)
How to implement API health checks in Spring Boot?

API Gateway Monitoring

What is API Gateway?
What metrics to monitor in API Gateway?
How to monitor API Gateway performance?
What is rate limiting in API Gateway?
How to implement throttling?

API Logging & Analytics

What should be logged for APIs?

Request/Response
Headers
Timestamps
User information
Errors

How to implement structured logging for APIs?
What is API analytics?
How to track API usage patterns?
What is request tracing?
How to correlate logs across microservices? (correlation ID)

API Security Monitoring

How to monitor API security?
What are common API security threats? (DDoS, SQL injection, unauthorized access)
How to detect API abuse?
What is anomaly detection in API monitoring?

Log Management & Best Practices (20 Questions)

Logging Fundamentals

What are log levels? (TRACE, DEBUG, INFO, WARN, ERROR, FATAL)
When to use each log level?
What is structured logging?
What is the difference between structured and unstructured logs?
What is JSON logging? Benefits
What should be included in log messages?

Timestamp
Log level
Service name
Correlation ID
User ID
Error details

Logging Frameworks

Popular Java logging frameworks:

Logback
Log4j2
SLF4J (Simple Logging Facade)

What is SLF4J? Why use it?
What is the difference between Log4j, Log4j2, and Logback?
How to configure logging in Spring Boot?
What is logging pattern/layout?

Log Aggregation

What is log aggregation? Why is it important?
What is centralized logging?
How to implement centralized logging in microservices?
What is log retention policy?
How to handle log rotation?
What is log sampling?

Best Practices

What are logging best practices?

Use appropriate log levels
Include context information
Avoid logging sensitive data
Use correlation IDs
Implement log sampling for high-volume systems

How to avoid logging sensitive information? (passwords, credit cards, PII)
How to optimize log storage costs?
What is log enrichment?
How to search and analyze logs efficiently?

Alerting & Incident Management (20 Questions)

Alerting Fundamentals

What is alerting? Why is it important?
What makes a good alert?
What is alert fatigue? How to prevent it?
What is the difference between alert and notification?
What are alert severity levels? (Critical, High, Medium, Low)

Alert Types

What are different types of alerts?

Threshold-based alerts
Anomaly detection alerts
Composite alerts

What is threshold alerting?
What is anomaly-based alerting?
What is alert aggregation?
What is alert deduplication?

Alert Configuration

What factors to consider when setting alert thresholds?
What is alert hysteresis?
What is alert flapping? How to prevent it?
What is alert routing?
What is on-call rotation?
How to prioritize alerts?

Incident Management

What is incident management process?
What is incident severity classification?
What is MTTR (Mean Time To Repair)?
What is MTTD (Mean Time To Detect)?
What is MTTA (Mean Time To Acknowledge)?
What are incident management tools? (PagerDuty, Opsgenie, VictorOps)
What is incident postmortem? Why is it important?
What is runbook/playbook?

Scenario-Based Questions (40 Questions)

Performance Issues

Your application response time suddenly increased. How would you troubleshoot?
How would you identify if the issue is in application, database, or network?
CPU usage is at 100%. How would you investigate?
Memory usage is continuously growing. How would you detect memory leaks?
Database queries are slow. How would you optimize?
How would you handle a sudden spike in traffic?
Application is timing out. How would you debug?

Monitoring & Observability

How would you set up monitoring for a new microservice?
What metrics would you monitor for a REST API?
How would you monitor database performance?
How would you implement distributed tracing across 10 microservices?
How would you correlate logs across multiple services?
How would you monitor Kafka consumer lag?
How would you detect if a microservice is down?
How would you monitor Redis cache hit rate?

Alerting & Incident Response

You received an alert about high error rate. What steps would you take?
How would you configure alerts to avoid false positives?
Multiple alerts are firing. How would you prioritize?
A critical service is down at 3 AM. Walk through your incident response process
How would you implement on-call rotation for your team?
How would you conduct a postmortem after an incident?

ELK Stack Scenarios

Elasticsearch cluster is slow. How would you optimize?
How would you handle log ingestion of 1TB/day?
Elasticsearch nodes are running out of memory. What would you do?
How would you search for specific error messages across millions of logs?
How would you implement log retention policy for cost optimization?
Kibana dashboards are loading slowly. How would you troubleshoot?

Kafka Scenarios

Kafka consumer lag is increasing. How would you address it?
A Kafka broker went down. What happens?
How would you handle Kafka rebalancing issues?
Messages are being duplicated. How would you ensure exactly-once delivery?
How would you migrate Kafka cluster without downtime?
How would you scale Kafka to handle 10x traffic?

Redis Scenarios

Redis is running out of memory. What would you do?
Cache hit rate is very low. How would you improve it?
How would you handle cache stampede during peak traffic?
Redis master went down. How does failover work?
How would you implement rate limiting using Redis?
How would you migrate from single Redis instance to Redis Cluster?

Prometheus & Grafana Scenarios

Prometheus is consuming too much storage. How would you optimize?
How would you monitor multiple Kubernetes clusters with Prometheus?
Grafana dashboard is not showing recent data. What could be wrong?
How would you create a dashboard for database performance monitoring?
How would you set up alerts for API latency > 500ms?

CDN & API Gateway Scenarios

CDN cache hit rate is low. How would you improve it?
Static assets are not being cached. How would you debug?
How would you handle CDN cache invalidation for a critical update?
API Gateway is becoming a bottleneck. How would you scale?
How would you implement rate limiting at API Gateway level?

System Design with Monitoring

Design a monitoring system for an e-commerce application
How would you monitor a microservices-based system with 50+ services?
Design an alerting strategy for a payment processing system
How would you implement observability in a serverless architecture?
Design a logging strategy for a multi-region deployment

Best Practices & Guidelines (25 Questions)

Monitoring Best Practices

What are the key principles of effective monitoring?

Monitor what matters
Keep it simple
Avoid alert fatigue
Use meaningful metrics

What is the USE method? (Utilization, Saturation, Errors)
What is the RED method? (Rate, Errors, Duration)
What metrics should be monitored at different layers?

Application layer
Infrastructure layer
Network layer
Database layer

How to establish SLO (Service Level Objectives)?
What is the difference between SLI, SLO, and SLA?

Logging Best Practices

What are logging best practices in microservices?
How to implement correlation across distributed systems?
What should never be logged? (passwords, tokens, PII, credit cards)
How to balance between detailed logging and performance?
What is the cost of excessive logging?

Alerting Best Practices

What makes an actionable alert?
How many alerts are too many?
What is alert-to-noise ratio?
Should you alert on symptoms or causes?
What is the difference between alerts and notifications?

Performance Best Practices

What are performance monitoring best practices?
How to establish baseline metrics?
What is capacity planning?
How to perform load testing with monitoring?
What is chaos engineering? How does monitoring help?

Security & Compliance

How to ensure sensitive data is not exposed in logs?
What are compliance requirements for log retention? (GDPR, HIPAA)
How to implement audit logging?
How to secure monitoring endpoints?
What access controls should be in place for monitoring systems?

Tools Comparison (10 Questions)

Prometheus vs InfluxDB - When to use which?

Prometheus: Pull-based, optimized for metrics/monitoring, strong alerting, better for Kubernetes, PromQL
InfluxDB: Push-based, general-purpose time-series DB, better for IoT/sensor data, InfluxQL/Flux, built-in data retention
Use Prometheus for infrastructure monitoring, InfluxDB for application analytics

ELK Stack vs Splunk - Pros and cons

ELK Stack:
- Pros: Open-source, cost-effective, flexible, large community
- Cons: Complex setup, resource-intensive, requires maintenance
Splunk:
- Pros: Enterprise features, powerful analytics, better support, easier setup
- Cons: Expensive licensing, cost scales with data volume
Use ELK for cost-sensitive projects, Splunk for enterprise with budget

Grafana vs Kibana - Key differences

Grafana: Multi-source visualization, better for metrics/time-series, cleaner dashboards, alerting
Kibana: Tightly integrated with Elasticsearch, better for logs, built-in analytics, Elastic ecosystem
Use Grafana for metrics dashboards, Kibana for log analysis

Kafka vs RabbitMQ - Use cases

Kafka:
- High throughput, distributed streaming, log aggregation, event sourcing
- Durable, replay capability, horizontal scaling
RabbitMQ:
- Traditional message queue, complex routing, low latency, easier setup
- Better for request-reply patterns
Use Kafka for event streaming/big data, RabbitMQ for traditional messaging

Redis vs Memcached - Key differences

Redis:
- Multiple data structures, persistence, pub/sub, clustering, Lua scripting
- Single-threaded, feature-rich
Memcached:
- Simple key-value only, multi-threaded, no persistence
- Slightly faster for simple caching
Use Redis for complex use cases, Memcached for simple distributed caching

Logstash vs Fluentd - Comparison

Logstash:
- Elastic ecosystem, rich plugins, Grok patterns, Java-based (resource heavy)
Fluentd:
- Lightweight (Ruby/C), better performance, Cloud Native, CNCF project
- JSON native, unified logging layer
Use Logstash with ELK Stack, Fluentd for cloud-native/Kubernetes

Jaeger vs Zipkin - Distributed tracing comparison

Jaeger:
- Uber-developed, CNCF project, better for Kubernetes
- Adaptive sampling, hot-path support
Zipkin:
- Twitter-developed, simpler setup, more mature
- Better documentation, wider adoption
Both are good choices; choose based on ecosystem fit

New Relic vs Datadog - APM comparison

New Relic:
- Strong APM focus, easier learning curve, better for application monitoring
- Per-host pricing
Datadog:
- Better infrastructure monitoring, more integrations, real-time analytics
- Per-metric pricing, can be expensive
Choose based on primary use case (application vs infrastructure focus)

CloudWatch vs Prometheus for AWS - When to use which?

CloudWatch:
- Native AWS integration, no setup required, managed service
- Limited retention, AWS-specific
Prometheus:
- Open-source, flexible, better query language, cross-cloud
- Self-managed, requires setup
Use CloudWatch for AWS-only, Prometheus for multi-cloud/detailed metrics

Sentry vs ELK for error tracking - Comparison

Sentry:
- Specialized error tracking, better error grouping, release tracking
- Developer-friendly, issue assignment
ELK:
- General-purpose logging, full-text search, broader use cases
- More complex but more flexible
Use Sentry for application error tracking, ELK for comprehensive logging

Additional Advanced Topics (15 Questions)

Observability as Code

What is Observability as Code?

Defining monitoring, logging, and alerting configuration as code
Version control, peer review, automated deployment
Infrastructure as Code for observability

What are benefits of GitOps for monitoring?

Version control for dashboards and alerts
Reproducible environments
Easy rollback and audit trail

Service Mesh Observability

What is service mesh? How does it help observability?

Istio, Linkerd, Consul Connect
Automatic distributed tracing
Standardized metrics collection
Traffic visibility without code changes

What metrics does service mesh provide?

Request success rates
Latency distribution
Service dependencies
Circuit breaker stats

Cost Optimization

How to optimize monitoring costs?

Sampling high-volume metrics
Data retention policies
Log level filtering
Metric aggregation
Use tiered storage (hot/warm/cold)

What is metric cardinality explosion? How to prevent it?

Too many unique label combinations
Increases storage and query costs
Prevention: Limit label values, avoid unbounded labels, use label guidelines

Modern Observability Patterns

What is OpenTelemetry?

Vendor-neutral observability framework
Unified APIs for traces, metrics, logs
Auto-instrumentation support
CNCF project

What is eBPF in observability?

Extended Berkeley Packet Filter
Kernel-level observability without agents
Low overhead monitoring
Tools: Pixie, Cilium

What is continuous profiling?

Always-on performance profiling
Production-safe profiling
Identify performance regressions
Tools: Pyroscope, Parca

SRE & Reliability

What is SRE (Site Reliability Engineering)?

Applies software engineering to operations
Focus on reliability, scalability, automation
Error budgets and SLOs

What is error budget?

Acceptable downtime based on SLO
Balance between reliability and feature velocity
Example: 99.9% uptime = 43 minutes downtime/month

What are the four golden signals of SRE?

Latency: Time to serve requests
Traffic: Demand on system
Errors: Rate of failed requests
Saturation: Resource utilization

Cloud-Native Monitoring

How to monitor Kubernetes clusters?

Prometheus Operator
kube-state-metrics
Node exporter
cAdvisor for container metrics
Grafana dashboards

What is container monitoring? Key metrics

CPU and memory usage per container
Container restart count
Network I/O
Disk I/O
Tools: cAdvisor, Datadog, New Relic

How to monitor serverless applications?

Cold start duration
Invocation count and errors
Duration and memory usage
CloudWatch for AWS Lambda
Distributed tracing challenges

Real-World Integration Patterns (10 Questions)

How to integrate Prometheus with Spring Boot microservices?

Add Micrometer dependency
Enable Actuator with Prometheus endpoint
Configure Prometheus scraping
Create Grafana dashboards

How to set up centralized logging for microservices?

Filebeat on each service → Logstash → Elasticsearch → Kibana
Add correlation ID to all logs
Structured JSON logging
Log aggregation pattern

How to implement health checks across microservices?

Liveness probes (is service running?)
Readiness probes (can service handle traffic?)
Custom health indicators
Aggregate health status

How to monitor API Gateway (Kong/AWS API Gateway)?

Request/response metrics
Rate limiting metrics
Authentication success/failure
Backend service health
Integration with Prometheus/CloudWatch

How to integrate Kafka with monitoring systems?

JMX Exporter for Prometheus
Consumer lag monitoring (Burrow)
Kafka Manager/AKHQ for UI
Alert on lag, under-replicated partitions

How to monitor database connections in Spring Boot?

HikariCP metrics via Actuator
Monitor active, idle, pending connections
Connection pool saturation alerts
Query performance with slow query logs

How to implement circuit breaker monitoring?

Resilience4j with Micrometer
Monitor state transitions (closed/open/half-open)
Success/failure rates
Visualize in Grafana

How to trace requests across API Gateway → Microservices → Database?

Spring Cloud Sleuth for trace ID generation
Propagate trace context in HTTP headers
Zipkin/Jaeger for trace collection
Visualize complete request flow

How to implement custom business metrics?

MeterRegistry in Spring Boot
Counter for events (orders, signups)
Timer for operations
Gauge for current state
Export to Prometheus

How to monitor scheduled jobs/batch processes?

Job execution time
Success/failure rate
Last successful run timestamp
Dead letter queue monitoring
Alert on job failures

Troubleshooting Checklist (10 Questions)

Application is slow - Where to start?
Check application metrics (response time, throughput)
Review recent deployments/changes
Check resource utilization (CPU, memory, disk)
Analyze slow queries in database
Check external service dependencies
Review logs for errors/warnings
High memory usage - Investigation steps
Take heap dump (jmap, Actuator)
Analyze with tools (MAT, VisualVM)
Check for memory leaks
Review garbage collection logs
Check cache sizes
Monitor memory growth over time
Database queries timing out - Debug approach
Enable slow query log
Check query execution plans (EXPLAIN)
Look for missing indexes
Check database connection pool
Review lock contention
Check database resource utilization
Microservice not responding - Troubleshooting
Check health endpoint
Review application logs
Check resource limits (CPU, memory)
Verify network connectivity
Check dependent services
Review recent deployments
Redis cache misses increasing - Investigation
Check cache hit/miss ratio
Verify TTL settings
Check memory usage and eviction
Review cache key patterns
Look for cache invalidation issues
Check client connection issues
Kafka consumer lag growing - Resolution steps
Check consumer processing time
Verify partition assignment
Scale consumers (add instances)
Optimize consumer batch size
Check for slow downstream dependencies
Review consumer configuration
Elasticsearch cluster yellow/red status - Fix
Check unassigned shards
Verify replica settings
Check disk space on nodes
Review cluster allocation settings
Check for node failures
Rebalance shards if needed
Prometheus scraping failures - Troubleshooting
Verify target is reachable
Check firewall/network rules
Verify metrics endpoint is exposed
Check Prometheus logs
Verify service discovery config
Test metrics endpoint manually
Grafana dashboard not updating - Debug
Check data source connection
Verify time range selection
Check query syntax
Review Prometheus/data source availability
Check dashboard refresh settings
Look for query errors in browser console
API Gateway returning 5xx errors - Investigation
Check gateway logs
Verify backend service health
Check timeout configurations
Review rate limiting rules
Check authentication/authorization
Verify routing configuration

Interview Preparation Tips

Common Interview Patterns

Pattern 1: Troubleshooting Scenarios

Always start with data/metrics
Follow systematic approach
Consider recent changes
Think about dependencies
Propose monitoring improvements

Pattern 2: System Design Questions

Define requirements first
Consider scale and load
Plan for failure scenarios
Include monitoring from start
Discuss trade-offs

Pattern 3: Tool Selection

Understand use case
Consider scale and cost
Think about team expertise
Integration with existing tools
Open-source vs commercial

Key Concepts to Master

Metrics Collection: Pull vs Push, sampling, cardinality
Log Aggregation: Centralization, parsing, storage, retention
Distributed Tracing: Correlation, context propagation, sampling
Alerting: Thresholds, alert fatigue, actionable alerts
Scalability: Horizontal scaling, partitioning, caching
High Availability: Replication, failover, disaster recovery

Quick Reference Metrics

Application:

Response time (p50, p95, p99)
Request rate (req/sec)
Error rate (%)
Active users/connections

Infrastructure:

CPU utilization (%)
Memory usage (%)
Disk I/O (IOPS, throughput)
Network I/O (bytes/sec)

Database:

Query execution time
Connection pool usage
Slow query count
Replication lag

Cache (Redis):

Hit rate (%)
Memory usage
Evicted keys
Connected clients

Message Queue (Kafka):

Consumer lag
Message rate
Under-replicated partitions
Broker availability

Prometheus (25 Questions)​

Core Concepts​

Metrics & Data Model​

PromQL (Prometheus Query Language)​

Integration & Exporters​

Alerting​

Grafana (20 Questions)​

Core Concepts​

Dashboards & Visualization​

Alerting​

Advanced Features​

ELK Stack (Elasticsearch, Logstash, Kibana) (30 Questions)​

Elasticsearch​

Logstash​

Kibana​

ELK Stack Integration & Best Practices​

Apache Kafka (35 Questions)​

Core Concepts​

Producers​

Consumers​

Replication & Fault Tolerance​

Performance & Scalability​

Kafka Streams & Connect​

Monitoring & Operations​

Redis (30 Questions)​

Core Concepts​

Data Structures & Commands​

Persistence​

Caching Strategies​

High Availability & Scalability​

Performance & Monitoring​

CDN (Content Delivery Network) (20 Questions)​

Core Concepts​

CDN Types & Architecture​

CDN Features​

Popular CDN Providers​

Spring Boot Actuator (25 Questions)​

Core Concepts​

Built-in Endpoints​

Metrics​

Security & Configuration​

Advanced Features​

Application Performance Monitoring (APM) (20 Questions)​

Core Concepts​

APM Tools​

Metrics & Monitoring​

API Monitoring & Management (20 Questions)​

API Monitoring​

API Gateway Monitoring​

API Logging & Analytics​

API Security Monitoring​

Log Management & Best Practices (20 Questions)​

Logging Fundamentals​

Logging Frameworks​

Log Aggregation​

Best Practices​

Alerting & Incident Management (20 Questions)​

Alerting Fundamentals​

Alert Types​

Alert Configuration​

Incident Management​

Scenario-Based Questions (40 Questions)​

Performance Issues​

Monitoring & Observability​

Alerting & Incident Response​

ELK Stack Scenarios​

Kafka Scenarios​

Redis Scenarios​

Prometheus & Grafana Scenarios​

CDN & API Gateway Scenarios​

System Design with Monitoring​

Best Practices & Guidelines (25 Questions)​

Monitoring Best Practices​

Logging Best Practices​

Alerting Best Practices​

Performance Best Practices​

Security & Compliance​

Tools Comparison (10 Questions)​

Additional Advanced Topics (15 Questions)​

Observability as Code​

Prometheus (25 Questions)

Core Concepts

Metrics & Data Model

PromQL (Prometheus Query Language)

Integration & Exporters

Alerting

Grafana (20 Questions)

Core Concepts

Dashboards & Visualization

Alerting

Advanced Features

ELK Stack (Elasticsearch, Logstash, Kibana) (30 Questions)

Elasticsearch

Logstash

Kibana

ELK Stack Integration & Best Practices

Apache Kafka (35 Questions)

Core Concepts

Producers

Consumers

Replication & Fault Tolerance

Performance & Scalability

Kafka Streams & Connect

Monitoring & Operations

Redis (30 Questions)

Core Concepts

Data Structures & Commands

Persistence

Caching Strategies

High Availability & Scalability

Performance & Monitoring

CDN (Content Delivery Network) (20 Questions)

Core Concepts

CDN Types & Architecture

CDN Features

Popular CDN Providers

Spring Boot Actuator (25 Questions)

Core Concepts

Built-in Endpoints

Metrics

Security & Configuration

Advanced Features

Application Performance Monitoring (APM) (20 Questions)

Core Concepts

APM Tools

Metrics & Monitoring

API Monitoring & Management (20 Questions)

API Monitoring

API Gateway Monitoring

API Logging & Analytics

API Security Monitoring

Log Management & Best Practices (20 Questions)

Logging Fundamentals

Logging Frameworks

Log Aggregation

Best Practices

Alerting & Incident Management (20 Questions)

Alerting Fundamentals

Alert Types

Alert Configuration

Incident Management

Scenario-Based Questions (40 Questions)

Performance Issues

Monitoring & Observability

Alerting & Incident Response

ELK Stack Scenarios

Kafka Scenarios

Redis Scenarios

Prometheus & Grafana Scenarios

CDN & API Gateway Scenarios

System Design with Monitoring

Best Practices & Guidelines (25 Questions)

Monitoring Best Practices

Logging Best Practices

Alerting Best Practices

Performance Best Practices

Security & Compliance

Tools Comparison (10 Questions)

Additional Advanced Topics (15 Questions)

Observability as Code